Improving Language Modeling by Combining Heteogeneous Corpora

نویسندگان

  • Zheng-Yu ZHOU
  • Jian-Feng GAO
  • Eric CHANG
چکیده

In applying statistical language modeling, directly adding training data (e.g. from website) may not always improve the performance of language models because the data may not be suitable for the application or contain errors. This paper presents a method of combining multiple heterogeneous corpora to improve the resulting language models, called compressed context-dependent interpolation scheme. The basic idea behind our method is that we not only want to filter good data, but also want to balance it among all the training data in order to give greater emphasis to data that better matches real usage scenarios or better balances our overall training set. Improvement on the accuracy of phonecharacter conversion has been observed in our experiments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving language model perplexity and recognition accuracy for medical dictations via within-domain interpolation with literal and semi-literal corpora

We propose a technique for improving language modeling for automated speech recognition of medical dictations by interpolating finished text (25M words) with small humangenerated literal or/and machine-generated semiliteral corpora. By building and testing interpolated (ILM) with literal (LILM), semiliteral (SILM) and partial (PILM) corpora, we show that both perplexity and recognition results ...

متن کامل

Improving Question Answering for Reading Comprehension Tests by Combining Multiple Systems

Most work on reading comprehension question answering systems has focused on improving performance by adding complex natural language processing (NLP) components to such systems rather than by combining the output of multiple systems. Our paper empirically evaluates whether combining the outputs of seven such systems submitted as the final projects for a graduate level class can improve over th...

متن کامل

Experiments on Processing Overlapping Parallel Corpora

The number and sizes of parallel corpora keep growing, which makes it necessary to have automatic methods of processing them: combining, checking and improving corpora quality, etc. We here introduce a method which enables performing many of these by exploiting overlapping parallel corpora. The method finds the correspondence between sentence pairs in two corpora: first the corresponding langua...

متن کامل

Improving Persian-English Statistical Machine Translation:Experiments in Domain Adaptation

This paper documents recent work carried out for PeEn-SMT, our Statistical Machine Translation system for translation between the English-Persian language pair. We give details of our previous SMT system, and present our current development of significantly larger corpora. We explain how recent tests using much larger corpora helped to evaluate problems in parallel corpus alignment, corpus cont...

متن کامل

Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

In statistical language modeling, one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the effects of applying such a technique to higherorder n-gram models trained on large corpora. We introduce a modification of the exchange clustering algorithm with improved efficiency for certain partially clas...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002